Generalized unknown morpheme guessing for hybrid POS tagging of Korean
نویسندگان
چکیده
Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general lexical patterns of Korean morphemes with a posteriori syllable tri-gram estimation. The syllable tri-grams help to calculate lexical probabilities of the unknown morphemes and are utilized to search the best tagging result. In our scheme, we can guess the POS's of unknown morphemes regardless of their numbers and positions in an eojeol, which was not possible before in Korean tagging systems. In a series of experiments using three different domain corpora, we can achieve 97% tagging accuracy regardless of many unknown morphemes in test corpora. 1 I n t r o d u c t i o n Part-of-speech (POS) tagging has many difficult problems to attack such as insufficient training data, inherent POS ambiguities: and most seriously unknown words. Unknown words are ubiquitous in any application and cause major tagging failures in many cases. Since Korean is an agglutinative language, we have unknown morpheme problems instead of unknown words in our POS tagging. The usual way of unknown-morpheme handling before was to guess possible POS's for an unknown-morpheme by checking connectable " This project was supported by KOSEF (teukjeongkicho #970-1020-301-3, 1997). functional morphemes in the same eojeol l (Kang, 1993). I n this way, they could guess possible POS's for a single unknown-morpheme only when it is positioned in the begining of an eojeol. If an eojeol contains more than one unknown-morphemes or if unknown-morphemes appear other than the first position, all the previous methods cannot efficiently estimate them. sO, we propose a morpheme-pattern dictionary which enables us to treat unknownmorphemes in the same way as registered known morphemes, and thereby to guess them regardless of their numbers and positions in an eojeol. The unknown-morpheme handling using the morpheme-pattern dictionary is integrated into a hybrid POS disambiguation. The POS disambiguation has usually been performed by statistical approaches mainly using hidden markov model (HMM) (Cutting et al., 1992; Kupiec. 1992; Weischedel et al., 1993). However. since statistical approaches take into account neighboring tags only within a limited window (usually two or three), sometimes the decision cannot cover all linguistic contexts necessary for POS disambiguation. Also the approaches are inappropriate for idiomatic expressions for which lexical terms need to be directly referenced. The statistical approaches are not enough especially for agglutinative languages (such as Korean) which have usually complex morphological structures. In agglutinative languages, a word (called eojeol in Korean) usually consists of separable single stem-morpheme plus one or more functional morphemes, and the POS tag should be assigned to each morpheme to cope with the complex morphological phenomena. Recently, rule-based approaches are tAn eojeol is a Korean spacing unit(similar to English word) which usually consists of one or more stem morphemes and functional morphemes.
منابع مشابه
Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean
Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknownmorpheme-estimation method with POSTAG (POStech TAGger), which is a statistical and rule-based hybrid POS tagging system. This method of guessing unknown morphemes is based on a combination of a morpheme pattern dictionary...
متن کاملHybrid POS tagging with generalized unknown-word handling
This paper presents POSTAG 1 as a statistical/rule-based hybrid part-of-speech (POS) tagging system with generalized unknown-word handling. The POSTAG integrates morphological analysis with statistical POS disambigua-tion and post rule-based error-correction. The error-correction rules are automatically learned from a tagged corpus and selectively correct standard HMM tagging errors. The morpho...
متن کاملMultilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗
The aim of this dissertation is to study statistical methods for multilingual word segmentation and POS tagging with high accuracy. Word segmentation and part-of-speech (POS) tagging are fundamental language analysis tasks in natural language processing, and used in many applications. Existence of unknown words is a large problem in these tasks and they need to be properly handled. We attempt t...
متن کاملChinese POS Disambiguation and Unknown Word Guessing with Lexicalized HMMs
This article presents a lexicalized HMM-based approach to Chinese part-of-speech (POS) disambiguation and unknown word guessing (UWG). In order to explore word-internal morphological features for Chinese POS tagging, four types of pattern tags are defined to indicate the way lexicon words are used in a segmented sentence. Such patterns are combined further with POS tags. Thus, Chinese POS disam...
متن کاملUnsupervised Morphology Induction for Part-of-Speech Tagging
In this paper we present an unsupervised morphology induction algorithm that uses Alignment Based Learning (ABL) e. g. (Zaanen, 2001) for hypothesis generation. We show how this algorithm can be used to induce a lexicon and morphological rules for a wide range of natural languages. The resulting morphological rules and structures are optimized during the induction process using a constraint sat...
متن کامل